Accurate assembly of multiple RNA-seq samples with Aletsch

Abstract Motivation High-throughput RNA sequencing has become indispensable for decoding gene activities, yet the challenge of reconstructing full-length transcripts persists. Traditional single-sample assemblers frequently produce fragmented transcripts, especially in single-cell RNA-seq data. While algorithms designed for assembling multiple samples exist, they encounter various limitations. Results We present Aletsch, a new assembler for multiple bulk or single-cell RNA-seq samples. Aletsch incorporates several algorithmic innovations, including a “bridging” system that can effectively integrate multiple samples to restore missed junctions in individual samples, and a new graph-decomposition algorithm that leverages “supporting” information across multiple samples to guide the decomposition of complex vertices. A standout feature of Aletsch is its application of a random forest model with 50 well-designed features for scoring transcripts. We demonstrate its robust adaptability across different chromosomes, datasets, and species. Our experiments, conducted on RNA-seq data from several protocols, firmly demonstrate Aletsch’s significant outperformance over existing meta-assemblers. As an example, when measured with the partial area under the precision-recall curve (pAUC, constrained by precision), Aletsch surpasses the leading assemblers TransMeta by 22.9%–62.1% and PsiCLASS by 23.0%–175.5% on human datasets. Availability and implementation Aletsch is freely available at https://github.com/Shao-Group/aletsch. Scripts that reproduce the experimental results of this manuscript is available at https://github.com/Shao-Group/aletsch-test.


Introduction
The well-established high-throughput RNA sequencing (RNAseq) technologies have been pivotal and instrumental in many biological and biomedical studies.The recent advancement in singlecell RNA-seq technologies further enables profiling gene activities at single-cell resolutions.One of the critical steps in RNA-seq analysis is the determination of full-length expressed transcripts from the RNA-seq reads, a computational problem commonly known as the transcript assembly.Over the decades, significant endeavors have been dedicated to this problem, including studying the mathematical formulations behind (Shao and Kingsford 2019;Dias et al. 2022;Khan et al. 2022) and developing practical tools, including Cufflinks (Trapnell et al. 2010), CLASS2 (Song et al. 2016), TransComb (Liu et al. 2016), StringTie series (Pertea et al. 2015;Kovaka et al. 2019), and Scallop series (Shao and Kingsford 2017;Zhang et al. 2022), to name a few.
Despite these efforts, transcripts assembled from a single sample or cell often remains incomplete or fragmented.One scenario is that if a junction were not sequenced, then there is little chance for a single-sample assembler to correctly recover it.This motivates meta-assembly-assembling full-length transcripts from multiple RNA-seq samples.Meta-assembly provides a promising opportunity to take advantage of shared information across various samples, to fill in the gaps, particularly the missing exons and junctions within individual samples, and to contribute to a more accurate and complete assembly.A trivial meta-assembly approach is a simple integration of RNA-seq reads from individual samples followed by calling a single-sample assembler; this method not only ignores the divergence of individual samples but also significantly increases the complexity of resolving the splicing variants, and hence performs poorly in practice.Several specific meta-assembly methods have been proposed, including MiTie (Behr et al. 2013), ISP (Tasnim et al. 2015), TACO (Niknafs et al. 2017), StringTie-merge, PsiCLASS (Song et al. 2019), and TransMeta (Yu et al. 2022).The two recent meta-assemblers, PsiCLASS and TransMeta, significantly pushed forward the state-of-the-art with their specific algorithmic advancements, but as far as we know, certain limitations remain.For example, PsiCLASS does not perform as well when sample experiences low coverage, which hinders its broad use for single-cell RNAseq data; TransMeta requires an adequate number of individual graphs in one gene locus to confidently decompose the consensus graph, which may result in reduced accuracy on largely divergent samples.Meta-assembly is not as developed as single-sample assembly, creating a substantial gap that requires attention.
In addition to developing more sensitive algorithms, controlling falsely assembled transcripts is critical, as RNA-seq data are known to be notoriously noisy.Algorithms often produce incorrect transcripts on gene loci with complicated splicing variants, due to the lack of information to resolve ambiguity.While existing tools have developed heuristic criteria to eliminate spurious splice junctions (Yu et al. 2022), the reliance on fixed thresholds and rules often fails to accommodate the diverse scenarios with varying sequencing techniques and dataset distributions.Manually fine-tuning the large number of parameters provided by tools is too laborious and time-consuming to reach an optimum.
We introduce Aletsch, a new meta-assembler crafted for the precise assembly of multiple bulk or single-cell RNA-seq samples.Aletsch stands out by effectively harnessing information from multiple samples while retaining the distinct characteristics of each individual sample.It begins by constructing both individual and combined splice graphs.In these graphs, Aletsch "bridges" pairedend reads, leveraging the combined graph's rich junction information to restore complete fragments.This approach not only refines each sample's individual graph but also guides the graph decomposition with these full fragments.Aletsch extends the idea of "phase-preserving" (Shao and Kingsford 2017) in graph decomposition by factoring in edge supports from additional samples.This comprehensive approach allows Aletsch to generate a reliable set of candidate transcripts.Furthermore, we have integrated a random forest model, equipped with 50 well-designed features, to learn a scoring function that evaluates and ranks the assembled transcripts.With comprehensive experiments we demonstrate the model's exceptional transferability across various chromosomes, datasets, and species.

Materials and methods
Aletsch takes alignments of a set of RNA-seq samples as input, and generates scored transcripts.It starts with constructing individual splice graphs for each sample, and a combined splice graph from all samples.Aletsch then uses an iterative bridging algorithm to "bridge" two mates of pairedend reads.The bridged reads are used to reconstruct more accurate individual and combined graphs, while also providing longer "phasing paths" for graph decomposition.Aletsch employs a new algorithm for decomposing both individual and combined graphs into candidate transcripts, guided by the enhanced phasing paths.Throughout this process, a comprehensive set of features are collected that characterize each candidate transcript.These features are then fed into in a random forest model to predict a score for each candidate transcript.A pipeline of Aletsch is illustrated in Fig. 1.

Construction and clustering of splice graphs
Splice graph is instrumental for studying alternative splicing and transcript assembly.A splice graph, formally written as G ¼ ðV; E; wÞ, is a weighted acyclic graph, where V denotes exons or partial exons, E signifies splicing junctions, and wð�Þ represents coverage.Aletsch loads read alignments in each sample, and builds an individual splice graph for each gene locus.Specifically, splicing junctions are first extracted from the read alignments to form the edges E. These splicing positions also partition the reference genome, and consequently aid in identifying (partial) exons, which form the vertices V.
The weight function wð�Þ assigns weights to both vertices and edges in G.The weight of a vertex v, wðvÞ, is computed as the average coverage of the corresponding exon, and the weight of an edge e ¼ ðu; vÞ, i.e. wðeÞ or wðu; vÞ, is determined by the count of reads spanning the junction from exon u to exon v.
To make use of the shared signals across samples, Aletsch performs clustering over individual splice graphs, aiming to group splice graphs from different samples that correspond to the same gene locus.Aletsch computes the similarity of each pair of individual graphs G i and G j , defined as jS i \ S j j=jS i [ S j j, where S i and S j are the set of splicing positions in G i and G j , respectively.Aletsch then conducts singlelinkage clustering starting from the highest similarity: a cluster stops growing when its size reaches a threshold (a parameter, which is often set as the number of samples).
Aletsch handles each cluster separately and independently.A combined splice graph, denoted as G M , is constructed to represent the collective splicing and coverage information in a cluster.G M is constructed in the same way as individual graphs, but incorporates all reads used to build individual graphs in the cluster.
For either individual or combined graphs, vertices with indegree of 0 are called starting vertices, represented as V s � V, and those without-degree of 0 are called ending vertices, represented as V t � V. We add a source vertex s, connecting to all starting vertices u 2 V s with e ¼ ðs; uÞ of weight wðs; uÞ ¼ P ðu;vÞ2E ðwðu; vÞÞ.Similarly, all ending vertices v 2 V t are connected to a sink vertex t, with edge weight wðv; tÞ ¼ P ðu;vÞ2E ðwðu; vÞÞ.Within these notations, transcripts are equivalent to s-t paths in the splice graphs.

Bridging
We focus on the assembly of paired-end bulk or single-cell RNA-seq data in this work.Bridging is the process of inferring full-length fragments, essentially reconstructing the unsequenced portion of paired-end reads.Successful bridging provides more accurate structures and weights of splice graphs, while also offering longer phasing paths for resolving complex splicing variants, both of which are crucial for improving transcript assembly.In previous work, we developed an algorithm for bridging paired-end RNA-seq reads in the context of assembling individual RNA-seq samples (Zhang et al. 2022).Specifically, we formulated bridging as seeking an optimal path in the underlying (individual) splice graph that connects the two ends of paired-end reads.An efficient dynamic programming algorithm was designed for this purpose.We demonstrated that this bridging technique significantly improves the assembly of both linear (Zhang et al. 2022) and circular (Zahin et al. 2024) transcripts.
Using the above algorithm as a building block, here we develop a bridging framework for meta-assembly.First, pairedend reads are bridged independently for separate samples, using respective individual splice graphs.We then shift focus to reads that fail in this initial bridging.Such failures often indicate the incompleteness of individual splice graphs, typically due to missed junctions in corresponding samples (e.g.R 3 in Fig. 2).We therefore bridge these unbridged reads w.r.t. the combined splice graph, i.e. finding an optimal path in the combined splice graph that connects the two ends, using above algorithm.This stage of bridging will be successful if the missed junctions are sequenced in some other sample and appear in the combined graph.The bridged reads will then be used to refine the individual graphs, including restoring missed edges and correcting edge and vertex weights.See Fig. 2.
This new bridging framework represents one of the innovations of Aletsch that leverages information from multiple samples to boost meta-assembly.It is worth mentioning that Aletsch acknowledges that individual samples express differently: it prioritizes bridging within individual samples whenever feasible.Only when bridging fails, which strongly indicates incompleteness of individual splice graphs, does Aletsch "borrow" information from other samples.This cautious strategy keeps Aletsch from assembling transcript "artifacts" that mix junctions from different samples, striking a balance between maintaining sample-specific integrity and enhancing assembly accuracy.
We represent bridged fragments as paths in the underlying splice graphs, called phasing paths.These paths, along with the refined splice graph, are then piped into a subsequent graph-decomposition algorithm to obtain candidate transcripts.Phasing paths play a crucial role in guiding the graphdecomposition, particularly in complex branching scenarios.

Edge supports
We first construct a data structure that characterizes and stores the shared information across samples.This structure is integrated into the subsequent graph-decomposition algorithm.For each edge e in either an individual splice graph or the combined splice graph, we construct a size-N vector M e , where M e ½j� quantifies the "support" from G j to edge e, 1 ≤ j ≤ N, where N is the number of individual splice graphs in a cluster.Initially, M e ½j� ¼ 0 for all 1 ≤ j ≤ N. Below we show the calculation of M e ½j� based on the type of edge e.Let v be a vertex.We use LðvÞ and RðvÞ to represent the left and right genomic coordinate of the (partial) exon corresponding to v. Similarly, for an edge e ¼ ðu; vÞ, we use LðeÞ and RðeÞ to represent the left and right positions of e, i.e.LðeÞ :¼ RðuÞ and RðeÞ :¼ LðvÞ.We categorize edges into three types: junction edge, adjacent edge, and starting edge, based on their structures.Aletsch i309

Adjacent edge
We say edge e ¼ ðu; vÞ is an adjacent edge if u is adjacent to v in its genomic positions, i.e.LðeÞ þ 1 ¼ RðeÞ.An adjacent edge e 0 in G j supports e if Lðe 0 Þ ¼ LðeÞ and Rðe 0 Þ ¼ RðeÞ.If such e 0 exists, we define M e ½j� ¼ wðe 0 Þ.It is also possible that a vertex supports an adjacent edge.We define a vertex v 0 in G j supports an adjacent edge In this case, we assign M e ½j� ¼ wðv 0 Þ.As an example, v 0 3 supports e 2 in Fig. 3.

Starting/ending edge
We say edge e ¼ ðu; vÞ is a starting edge if u ¼ s.The support mechanism for starting edges is more complex.To calculate M e ½j�, we locate position RðvÞ in G j .If position RðvÞ locates within vertex v 0 in G j and there exists a starting edge e 0 ¼ ðs 0 ; v 0 Þ 2 G j , where s 0 is the source vertex of G j , then, we say that starting edge e 0 supports e and in this case we assign M e ½j� ¼ wðe 0 Þ.In the absence of such a starting edge e 0 , we continue to examine if there exists an incoming adjacent edge ðu 0 ; v 0 Þ and a starting edge e 00 ¼ ðs 0 ; u 0 Þ, and if both exist, we also say e 00 supports e and assign M e ½j� ¼ wðe 00 Þ.We impose a limitation on such left-extension that RðvÞ−Rðu 0 Þ should not exceed a threshold (200 base pairs by default).In Fig. 3, starting edge ðs 0 ; v 0 1 Þ supports edge ðs; v 1 Þ.The calculation of edge support for any ending edge e mirrors that of starting edges but extending to the right.We say edge e ¼ ðu; vÞ is an ending edge if v ¼ t.In Fig. 3, no ending edges in G j supports edge ðv 4 ; tÞ.

Graph decomposition
Aletsch implements a novel graph-decomposition algorithm that inputs a weighted splice graph G ¼ ðV; E; wÞ, edge supports M e for each e 2 E, and a set of phasing paths H.It produces a set of s-t paths P termed as candidate transcripts.This algorithm is consistently applied to each individual splice graph and the combined splice graph.The framework of this algorithm is to iteratively decompose vertices of G until it is reduced to a collection of parallel s-t paths.We follow two principles in decomposing a single vertex.The first one is phase-preserving: any phasing path h 2 H must appear entirely in at least one s-t path p 2 P.This principle is based on the understanding that phasing paths, derived from RNA fragments, should be segments of some expressed transcripts.We achieve phase-preserving by not breaking any phasing paths during decomposition.In the absence of phasing paths, we use the supporting information (i.e.M e ) to guide the decomposition-this is the second principle.
We design two subroutines for decomposing a single vertex, depending on its type.A trivial vertex in V n fs; tg is defined as having either in-degree of 1 or out-degree of 1.The procedure for decomposing a trivial vertex v is straightforward-see Fig. 4a.For trivial vertex v, given an in-edge e i and an out-edge e o , we create a new edge ðe i ; e o Þ after decomposing v.The edge support for this new edge is calculated as: M ei;eo ½j� ¼ minfM ei ½j�; M eo ½j�g, for every 1 ≤ j ≤ N.
To decompose a non-trivial vertex v, we construct a bipartite graph where V i and V o correspond to the in-edges and out-edges of v in G, respectively.If there exists a phasing path h that threads through e i , v, and e o in G, we add an undirected edge ðe i ; e o Þ into the bipartite graph G v .The edge supports for ðe i ; e o Þ is calculated as M ei;eo ½j� :¼ minfM ei ½j�; M eo ½j�g for each 1 ≤ j ≤ N.There might be "isolated" vertices (degree of 0) in G v , i.e. they are not threaded by any phasing paths, as shown in Fig. 4b.We employ edge supports to add edges for them in G v .Specifically, for an isolated vertex e (assuming e 2 V i without loss of generality), we calculate e � o :¼ argmax eo2Vo ð P 1 ≤ j ≤ N M e;eo ½j�Þ, and then, add an edge ðe; e � o Þ to G v .This procedure is applied to all isolated vertices in G v .
The decomposition of G follows the structure of G v : every edge ðe i ; e o Þ in G v becomes a new edge ðe i ; e o Þ in G after decomposing v. Additionally, the decomposed vertex v is recorded on these new edges with an auxiliary data structure, allowing for the reconstruction of the vertices list for the resulting s-t paths.
The above algorithm extends the graph-decomposition algorithm developed in Shao and Kingsford (2017), from which Aletsch adapts its iterative framework and the "phase-preserving" principle.We also reuse the weight-updating procedures in Scallop to calculate the weight wðeÞ for each newly created edge e; at the end of the decomposition, when G becomes a set of parallel s-t edges, this weight represents the inferred abundance of the corresponding candidate transcript.The innovation of this algorithm lies in its utilization of signals of multiple samples (i.e.edge supports) to guide the decomposition.Aletsch gives higher priority to phasing information associated with each individual graph, and only when such information is not available (i.e.isolated vertices), the information from other samples are employed.This strategy again reduces the chance for Aletsch to assemble transcript artifacts.
It is worth noting that, for each s-t edge e at the end, if M e ½j� > 0, it suggests that the s-t path corresponding to e also exists as a path in the splice graph of the j-th sample (this s-t path may or may not be assembled in the j-th sample, though).
Hence, both jfjjM e ½j� > 0; 1 ≤ j ≤ Ngj and P 1 ≤ j ≤ N M e ½j� are informative features that quantify how much other samples support this candidate transcript e.These features, together with others such as wðeÞ, will be used in the subsequent machine-learning model for scoring.

Features
It is of great interest to assign a confidence score to assembled transcript, indicating how likely it is to be correct.Traditionally, the inferred abundance serves this purpose, given its observed positive correlation with correctness.We explore a more precise scoring function through machine learning.Rooted in our deep understanding and extensive experience on transcript assembly, we have engineered 50 features to comprehensively characterize an assembled candidate transcript, ranging from its decomposition to its interaction with other transcripts and graphs.We categorize them below; a detailed description with intuitions is in Supplementary Note S1.
1) Abundance Features.This category includes the inferred abundance, number of supporting samples and their total weights, "bottleneck" weights and sample supports, and so forth.2) Graph Structural Features.This includes the number of vertices and junctions in the related individual splice graph and the combined splice graph, and so forth.3) Boundary Features.One major challenge in transcript assembly is the accurate prediction of starting and ending sites.We use features such as the supports of a boundary from other samples.They are particularly useful for single-cell RNA-seq data, where transcripts are more likely to be partially sequenced, which leads to false boundaries.4) Intron Features.The differentiation of transcripts with intron retentions from contaminated reads in intron regions presents another challenge.Features including the "competitiveness" of intron coverage and surrounding junctions are informative to distinguish them.

Scoring and meta-assembly
We trained a random forest to score candidate transcripts, each represented by a length-50 feature vector.The model is configured with 100 estimators and a maximum depth of 20, performing binary classification to yield a probability score between 0 and 1 for each candidate transcript.
It is common for "identical candidate transcripts" to be assembled from different samples, or from an individual graph and the combined graph.Here, we define two candidate transcripts as identical if they share the same intron-chain, a definition widely accepted.We aggregate identical candidate transcripts into a single "meta-transcript."The score of a meta-transcript is calculated as the average score of its consisting candidate transcripts (Fig. 1).We recommend discarding meta-transcripts with a score below 0.2.Aletsch sets a default threshold of 0.5, which users can adjust to balance recall and precision.Aletsch outputs the set of metatranscripts with their scores, consistent with other meta-assemblers.

Experimental setup
We compare Aletsch with two existing meta-assemblers, TransMeta (v.1.0)and PsiCLASS (v1.0.3), and two additional meta-assembly pipelines based on single-sample assemblers, StringTie2-merge (v2.2.1) which we refer to as ST2Merge for short, and Scallop2 (v1.1.2) combined with Taco (v0.7.3), referred to as SC2TACO.All methods take a set of aligned RNA-seq samples or cells as input and produce a set of (scored) meta-transcripts.When the RNA-seq data is non-synthetic which we often do not know the true expressed transcripts, the commonly used ground-truth for evaluation is reference annotation.A multi-exon meta-transcript is considered as a "match" if its intron-chain exactly matches with a transcript in the reference annotation.Two metrics can then be calculated with tool GffCompare (Pertea and Pertea 2020): precision, defined as the ratio between the number of matched and the total number of predicted meta-transcripts, and recall, defined as the ratio of the number of matched meta-transcripts to the total number of transcripts in the annotation.To conclude when one method has a higher recall but lower precision, or vice versa, it is common to draw the precision-recall curves (PRC) with a varying parameter.The parameter used for this purpose is the inferred confidence score for Aletsch, and the predicted abundance for other methods (Supplementary Note S2).In this way, the performance of different methods over a broader range of parameters can also be compared.Experimental results shown in the main text are all conducted using the Ensembl annotation.Results using the RefSeq annotation are available in Supplementary Fig. S7 and Supplementary Tables S1 and S2. (before): (after): (before): (after):  Our experiments were conducted on eight real datasets, including both bulk and single-cell RNA-seq data, detailed in Table 1 and Supplementary Tables S6 and S7.The comparison of runtime and memory is given in Supplementary Note S3 and Supplementary Tables S4 and S5.Since Aletsch has a machinelearning module for scoring, we pay close attention to the model's transferability to different datasets or even to different species.Therefore, we partition the eight datasets into three groups: the models are trained using G1, including four human datasets; G2 serves to show the models' performance on different human datasets, and G3 is used to show its performance on a new species.Note that G1 includes both bulk and single-cell datasets.Integrating both is challenging as they exhibit different properties such as sample size and read coverage.Aletsch addressed this challenge, thanks to its comprehensive featureengineering and normalizations on particular features.
We train two models Aletsch-Chr1 and Aletsch-ChrAll.To train Aletsch-Chr1, we extract the RNA-seq reads from G1 datasets aligned to chromosome 1.Aletsch is run on these reads to produce candidate transcripts, each equipped with a feature vector of size 50.These candidate transcripts are then labeled, using a given annotation, where a candidate transcript gets labeled as "1" if it matches the intron chain of an existing transcript in the annotation, and "0" otherwise.These training samples, each consisting of the feature vector plus the label, are then used to train a random forest model.Aletsch-ChrAll is trained similarly to Aletsch-Chr1, except that all reads in the G1 datasets are used.
We note that Aletsch-Chr1 "sees" the transcripts of chromosome 1 in the annotation, while Aletsch-ChrAll sees all transcripts in the annotation.Since the same annotation also serves as the ground-truth to evaluate the accuracy of the trained model, there is potential risk of "data-leaking."We emphasize that the features Aletsch uses are the quantification of either transcripts' abundances or structures, without revealing any genomic coordinates; hence, data-leaking is impossible.But still, we carefully designed the experiments to prove that Aletsch learns the intrinsic characterizations of true transcripts.To do so, we test the accuracy of Aletsch-Chr1 on other chromosomes (excluding chromosome 1), as annotated transcripts for these chromosomes are never seen by Aletsch-Chr1.Below, we will show that Aletsch-Chr1 substantially outperforms existing meta-assemblers on other chromosomes and that the accuracy of Aletsch-Chr1 matches the accuracy of Aletsch-ChrAll.These jointly testify that no data leaks occur in the training process and both models can be safely used.

Results on other chromosomes of G1
We first compare Aletsch-Chr1 with other meta-assembly methods on G1, but for chromosomes other than the first (termed chr2þ).Their PRCs are given in Fig. 5.Note that Aletsch-Chr1's curve is notably higher on all four datasets, showing that Aletsch outperforms other meta-assemblers.
To quantify the improvement, we measure the area under the PRC.Since the ranges of precision or recall of different methods vary, we compare Aletsch with each of the other methods by locating the shared range, either by recall or by precision, and calculate the partial area under the PRC (pAUC).The detailed results are given in Tables 2 and 3. We observe that the improvement of Aletsch over other methods is substantial.On G1 datasets, the improvement over the second-best method TransMeta, ranges from 9.2% to 31.6% when the pAUC is constrained by recall.The improvement is even more pronounced when the pAUC is constrained by precision (39.6%-50.6%);this is because Aletsch is able to assemble more matched transcript at the cost of the same amount of precision.
The success of Aletsch-Chr1 suggests that the model learned from one chromosome can be transferred to other chromosomes, proving that the model effectively captures the intrinsic features of true transcripts.Its high accuracy on both bulk and single-cell datasets also illustrates the feasibility of developing a versatile model for multiple types of datasets.
Many cells in the SC-H2 dataset have low read coverage, and it is expected that some junctions in these cells are not sequenced.This makes methods based on single-sample assemblers almost impossible to assemble correctly.Aletsch is able to recover the missed junctions from other cells/samples.Aletsch also decomposes the combined graph, another mechanism to rescue such transcripts.Note that whether a transcript gets assembled in both the individual and combined graphs will be recorded as features which will eventually be reflected in its confidence score.Aletsch is therefore still accurate in assembling single-cell datasets with low read coverage.For example, the improvement of Aletsch over ST2Merge and SC2TACO on this dataset is 123.9% and 220.5% when measured with pAUC constrained by recall.

Results on G2
We investigate the performance of Aletsch on G2 that are not touched in the training.We separate chromosome 1 and other chromosomes.Figure 6 gives the PRCs of Aletsch-Chr1, Aletsch-ChrAll, and other methods.Aletsch substantially outperforms other methods on all datasets, demonstrating its transferability to different datasets.Another two key observations are that Aletsch-Chr1 shows no obvious difference between chr1 and the other chromosomes, and that Aletsch-Chr1 and Aletsch-ChrAll exhibit nearly identical performance.This consistency firmly proves that Aletsch has learned the intrinsic characteristics to select true transcripts.Otherwise, one would observe a significant performance drop on other chromosomes and a significant performance boost when using Aletsch-ChrAll.

Results on G3
We compare the performance of different meta-assemblers on G3, consisting of two mouse datasets.We are particularly interested in if a model learned from human can achieve similar improvement on a different species.The PRCs are given in Fig. 7, in which Aletsch-ChrAll is used.Again Aletsch achieves superior accuracy consistently, indicating an remarkable transferability across different species.

Results on simulations
One of the major challenges for meta-assembly is that the input samples/cells may or may not express similar sets of

Aletsch i313
transcripts.We investigate the performance of different methods at different levels of "similarity" using simulations.We simulated three datasets using Polyester (Frazee et al. 2015), namely Polyester-Sim90, Polyester-Sim50, and Polyester-Sim10, each consisting of 30 samples.In simulation, we first randomly picked 101 453 multi-exon transcripts across 9985 gene loci from the Ensembl annotation.The ground-truth of each sample is created by randomly picking p% transcripts from this set, where p ¼ 10; 50; 90, corresponding to the three datasets.Therefore, the ground-truth transcripts in the Polyester-Sim90 samples will be more likely to be shared.Aletsch was trained on a separate simulated dataset with p ¼ 50 from a ground-truth that does not share any gene used in simulating the 3 testing datasets.The PRCs are given in Fig. 8. TransMeta and PsiCLASS performed quite differently on these simulated datasets compared to real datasets.Aletsch, however, adapted to new distributions and outperformed the others.Specifically, Aletsch achieved pAUC, constrained by recall, 57.7%-78.3%higher than TransMeta, 10.2%-17.4% higher than PsiCLASS, 17.5%-84.6%higher than ST2Merge, 29.9%-42.8%higher than SC2TACO.
As sample similarity decreased, ST2Merge's performance became closer to that of PsiCLASS, while TransMeta's results deteriorated.This is likely attributed to the difficulty in achieving "consensus" in TransMeta or making confident "voting" decisions in PsiCLASS under low similarity.Aletsch, in contrast, recognizes the diversity in the expressed transcript across samples and avoids enforcing consensus.This strategy allows Aletsch to achieve much improved accuracy in cases like Polyester-Sim10, where samples show high variability.

Discussion
We introduced Aletsch, a new meta-assembler for bulk and single-cell RNA-seq data.Aletsch achieved superior accuracy by fully taking advantage of shared information across multiple samples.It's flexible, allowing easy retraining for new sequencing protocols or tissue-specific datasets.For underexplored species lacking annotations, we recommend training on phylogenetically similar species, thanks to Aletsch's proven transferability across datasets and species.
Aletsch is currently tailored for short-read bulk and singlecell RNA-seq data.Recent advancements in long-read RNAseq technologies offer improved capabilities for detecting full-length isoforms.Extending Aletsch to concurrently assemble both short-reads and long-reads represents a formidable task that we intend to explore.Despite the emergence of long-read technologies, Aletsch remains highly impactful, as short-read RNA-seq data continue to serve as the de facto standard and are widely utilized.Moreover, a large volume of short-read RNA-seq data has been deposited, making Aletsch applicable for enhanced analysis and the potential discovery of novel biological insights.On the other hand, identifying full-length isoforms from long-read RNA-seq data remains a challenge, particularly in cases where annotation is unavailable, primarily due to difficulties in accurately determining splicing junctions (Pardo-Palacios et al. 2021).This Aletsch-Chr1 was used for G1 and G2 datasets; and Aletsch-ChrAll for G3.PsiCLASS experienced segment-fault on SC-H2.limitation is one reason why we did not opt for long-read datasets as the ground truth for evaluation.Meta-assembly that harnesses the strengths of both short-read and long-read RNA-seq data, which we plan to extend Aletsch for, holds promise for significantly advancing isoform reconstruction.Currently, the score for a meta-transcript is calculated by averaging the scores of its corresponding identical transcripts.This approach proves effective for our scenario, as we observe that the scores for the candidate identical transcripts tend to cluster within a narrow range.This phenomenon is mainly because of the shared "meta-coverage" across all identical transcripts.We recognize that directly learning a score for each meta-transcript is an intriguing machinelearning task and worth independent exploration.
Besides a random forest, we also experimented with training a sequence model, specifically an LSTM, for path embedding.While LSTMs excel in processing time series data, our efforts to use them for binary classification of transcripts (determining the correctness of a transcript) were not entirely successful: the model struggled to capture the relationship between adjacent edges in an s-t path.Although our current random forest model performed well on existing features, we plan to investigate advanced embedding techniques to hopefully replace hand-crafted features.

Figure 1 .
Figure 1.Workflow of Aletsch.(a) Constructing individual and combined splice graphs.Dashed green lines indicates junctions inferred from clipped reads.(b) Bridging paired-end reads to create enhanced phasing paths, shown as dashed green blocks.(c) Refining individual splice graphs.(d) Decomposing refined individual splice graphs and the combined graph, guided by enhanced phasing paths.(e) Grouping identical intron-chains (e.g.τ 3 and τ 4 ) and scoring candidate transcripts with a random forest model to generate the final meta-transcripts.

Figure 2 .
Figure 2. Bridging in Aletsch: Reads R 1 and R 2 are successfully bridged in graph G 1 (dashed green blocks).Read R 3 fails to bridge in G 1 due to the absence of a path from v 2 to v 4 .But it can be bridged in combined graph G M (orange), restoring a missed junction, and updating edge weights in G 1 .Bridged reads in G 1 form phasing path set H 1 .

Figure 3 .
Figure 3. Types of edge support from splice graph G j to G i : ‹ Junction edge.› Adjacent edge.fi Starting/ending edge.

Figure 4 .
Figure 4. Examples of vertex decomposition.Red dashed arrow lines represent the phasing paths.In (b), each M e has three entries, representing edge support from three samples in the gene locus.Isolated vertices in the bipartite graph G v are resolved from left to right.For the isolated vertex e 3 on the left side, e 5 ¼ argmax eo 2fe4;e5;e6 g ð P 1≤j≤3 M e3;eo ½j�Þ.Accordingly, we add an edge between e 3 and e 5 in G v , shown as a black dashed line.

Figure 5 .
Figure 5.Comparison of PRCs of different assemblers on G1.Circled points indicate each tool's default settings, while starred points denote the maximum recall achieved with the coverage filter disabled.

Figure 6 .
Figure 6.Comparison of PRCs of different assemblers on G2.Circled points show default settings; starred points indicate filtering disabled.

Table 1 .
Summary of real datasets.